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Abstract 

The problem of estimating the mean of a normal vector with known but unequal 
variances introduces substantial difficulties that impair the adequacy of traditional 
empirical Bayes estimators. By taking a different approach, that treats the known 
variances as part of the random observations, we restore symmetry and thus the 
effectiveness of such methods. We suggest a group-linear empirical Bayes estimator, 
which collects observations with similar variances and applies a spherically symmetric 
estimator to each group separately. The proposed estimator is motivated by a new 
oracle rule which is stronger than the best linear rule, and thus provides a more 
ambitious benchmark than that considered in previous literature. Our estimator 
asymptotically achieves the new oracle risk (under appropriate conditions) and at 
the same time is minimax. The group-linear estimator is particularly advantageous 
in situations where the true means and observed variances are empirically dependent. 

To demonstrate the merits of the proposed methods in real applications, we analyze 
the baseball data used in Brown (2008), where the group-linear methods achieved 
the prediction error of the best nonparametric estimates that have been applied to 
the dataset, and significantly lower error than other parametric and semi-parametric 
empirical Bayes estimators. 
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1 Introduction 


Let X = {Xi,Xn)'^, 6 = ( 6 * 1 , 6 *„)^ and V = (Vi, Vn)^, and snppose that 


X,m,Vi) N{9„Vi) 


( 1 ) 


independently for 1 < i < n, where 6 and V are deterministic. In the heteroscedastic 


normal mean problem, the goal is to estimate the vector 6 based on X and V nnder the 
(normalized) snm-of-sqnares loss 


n 



( 2 ) 


i=l 


Hence we assnme that in addition to the random observations Xi,...,X„, the variances 
Vi,...,Vn are available. Allowing the valnes of Vi to be different from each other extends 
the applicability of the homoscedastic Ganssian mean problem to many realistic sitnations. 
A simple bnt common example is the design corresponding to a one-way homoscedastic 
Analysis of Variance with nneqnal cell connts; here Xi represents the mean of the rii i.i.d. 
N{9i,a‘^) observations for the Ath snb-popnlation, hence V) = cVIni. More generally, if 
Y ~ Np{Af3, cr^J) with a known design matrix A, then estimating (3 nnder snm-of-sqnares 
loss is eqnivalent to estimating 0 in (1) where n = rank{A) and Xi and Vila‘S are determined 
by A (see, e.g., Johnstone, 2011, section 2.9). In both cases V) are typically known only np 
to a proportionality constant which can be snbstitnted by a consistent estimator. 

The normal mean problem has been stndied extensively for both the special case of 
eqnal variances, V) = cx^, and the more general case above. Alternative estimators to the 
nsnal minimax estimator 6 = X have been snggested that perform better, for hxed n or 
only asymptotically (nnder some conditions), in terms of the risk Rn{6,6) = E0[Ln{e,e)], 
regardless of 6. Here and elsewhere we suppress in notation the dependence of the risk 
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function on V. 


In the heteroscedastic case there is no agreement between minimax estimators and 
existing empirical Bayes estimators regarding how the components of JC should be shrunk 
relative to their individual variances. Existing parametric empirical Bayes estimators, which 
usually start by putting an i.i.d. normal prior on the elements of 0 and therefore shrink Xj 
in proportion to Vi, are in general not minimax. And vice versa, minimax estimators do not 
provide substantial reduction in the Bayes risk under such priors, essentially under-shrinking 
the components with larger variances, and in some constructions (e.g. Berger, 1976) even 
shrink Xj inversely in proportion to V). Nontrivial spherically symmetric shrinkage estimators 
that have been suggested, that is, estimators that shrink all components by the same factor 
regardless of V), are minimax only when the Vi satisfy certain conditions that restrict how 
much they can be spread out. See Tan (2015) for a concise review of some existing estimators 
and references therein for related literature. Before proceeding, we remark that it is tempting 
to scale Xj by 1/-y/h^ in order to make all variances equal; however, after applying this non- 
orthogonal transformation the loss needs to be changed accordingly (to a weighted loss) in 
order to maintain equivalence between the problems. 

There have been attempts to moderate the respective disadvantages of estimators 
resulting from either of the two approaches mentioned above. For example, Xie et ah (2012) 
consider the family of Bayes estimators arising from the usual hierarchical model 

0i~X(/i,7) X,|0,‘^"X(0„I/,) l<*<n (3) 


and indexed by fi and 7 . They suggest to plug into the Bayes rule. 





Xi 


V^^ + 7 


{Xi 




( 4 ) 


values (/1, 7 ) = argmin^ 77(p, 7 ; X) where 7l{fi,'j; X) is an unbiased estimator of the 
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risk of 6^’'^. This reduces the sensitivity of the estimator to how appropriate model (3) 
is, as compared to the usual empirical Bayes estimators, that use Maximum Likelihood 
or Method-of-Moments estimates of /i, 7 under (3). On the other hand, Berger (1982) 
suggested a modification of his own minimax estimator (Berger, 1976), that improves 
Bayesian performance while retaining minimaxity. Tan (2015) recently suggested a minimax 
estimator with similar properties that has a simpler form. 

While empirical Bayes estimators based on (3) can be constructed so they asymptotically 
dominate the usual estimator (Xie et ah, 2012), the modeling of 6i as identically distributed 
random variables is often not as well motivated in the heteroscedastic case as it is in the equal 
variances case. The assumption that 9i are i.i.d. reflects, as commented by Efron and Morris 
(1973b), a “Bayesian statement of belief that the 9i are of comparable magnitude”. But 
this assumption is not always appropriate. There are many examples where an association 
between the V) and the 9i is expected: in Section 5 we consider batting records for Major 
League baseball players, where better performing players tend to also have larger numbers of 
at-bats (affecting the sampling variances of the observations). In situations where the true 
means and the V) are associated, modeling the 9i as i.i.d. is not adequate. Nevertheless, 
symmetry can be restored in the heteroscedastic case by treating the pair (Xj, Vi) as the 
random data. This observation leads us to develop a block-linear empirical Bayes estimator 
that groups together observations with similar variances and applies a spherically symmetric 
minimax estimator to each group separately. 

The rest of the paper is organized as follows. Section 2 presents the estimation of a 
heteroscedastic mean as a compound decision problem. This motivates the construction 
of a group-linear empirical Bayes estimator in Section 3; we discuss the properties of the 
proposed estimator and prove two oracle inequalities, which establish a sense of asymptotic 
optimality with respect to the class of estimators that are “conditionally” linear. Simulation 
results are reported in Section 4. In Section 5 we apply our estimator to the baseball data 
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of Brown (2008) and compare it to some of the best-performing estimators that have been 
tested on this dataset. Proofs appear in the appendix. 


2 A Compound Decision Problem for the 
Heteroscedastic Case 

Let X, 6 and V be as in (1). It is convenient to think of 0 and V as nonrandom, although 
the derivations below hold also when 6 or V (or both) are random. In the sequel we refer 
to a rule 6 as separable if 6i{X, V) = t{Xi, Vi) for some function t : M x M_|_ —)• M. Denote 
by Vs the set of all separable rules. If 0 G Vs with 6i{X, V) = t{Xi, Vi), then 



where the expectation in the last term is taken over the random vector {Y, A, J)^ 
distributed according to 


P(/ = *) = l/n, {Y,^,A)\{I = z)r^{Xi,9i,Vi) 1 < ^ < n. 


( 6 ) 


Above, the symbol “ ~ ” stands for “equal in distribution”. In words, (6) says that (^, A) 
have the empirical joint distribution of the pairs {9i, Vi); and D| (,^, A) ~ N[^, A). Throughout 
the paper, when we refer to the random triple {Y, A), its relation to (Xj, 9i,Vi), 1 <i <n, is 

given by (6). The identity (5) - a computation a la Robbins - is easily verihed by calculating 
the expectation on the right hand side by hrst conditioning on I. It says that for a separable 
estimator, the risk is equivalent to the Bayes risk in a one-dimensional estimation problem. 

Now consider 6 G Vs with t linear (affine, in point of fact, but with a slight abuse of 
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terminology we use the former term for convenience) in its first argument, that is, 


ef{X, V) = X,- b{Vi)[X, - a{V,)] l<i<n (7) 

for some functions a, b. The corresponding Bayes risk in the last expression of (5) is 

r„(a, b) := e{y - b{A)[Y - a{A)] - e}'. (8) 

Since 

F|(e,7l)~iV(e,7l), (9) 


the minimizers of 


r„(o,()|!i) := Ejiy - ()(/l)|K-a(/l)| -iV A = t>|, 


( 10 ) 


and hence also of (8), are 


a:(u) = E(F|7l = u), bl{v) = 


Var(y|7l = v) 


( 11 ) 


and the minimum Bayes risk is 




■)=r„K,6;) = E 


( 12 ) 


Therefore, (12) is a lower bound on the risk achievable by any estimator of the form (7), 
and is the optimal solution within the class. Note that any estimator of the form (4) 

is also of the form (7), hence the risk of the best (oracle) rule of the form (7) is no greater 
than the risk of the best rule of the form (4). If ^ and A are independent, a)j(u) = E(y|24 = 
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v) = IE(^|v4 = v) = E(,^), b^(v) = v/(v + Var(^)), and the oracles of the forms (4) and (7) 

coincide. 

Finally, we note that existing nonparametric empirical Bayes estimators, snch as the semi- 
parametric estimator of Xie et al. (2012) and the nonparametric method of Jiang and Zhang 
(2010), target the best predictor g(V,A) of ^ where g is restricted to some nonparametric 
class of fnnctions. While the optimal g may indeed be a non-linear fnnction of V, these 
methods implicitly assnme independence between ^ and A, and might still snffer from the 
gap between the optimal predictor g(V,A) assnming independence, and the true Bayes rule, 
namely, E(^|y, A). Therefore, in some cases the oracle rule of the form (7) might still have 
smaller risk than the oracle choice of g computed assuming independence between ^ and A. 

3 Group-linear Shrinkage Methods 

Let JC, 6 and W be as in (1). The estimator in the following lemma will serve as a building 
block for our group-linear estimator. Note in this estimator that X is used as an estimate 
of the overall group mean. In addition, the estimator is spherically symmetric as a function 
oi X — X. Similar estimators that center on a known mean, and variations, have been 
discussed in Brown (1975, Theorem 3), Bock (1975), Berger (1985), Lehmann and Casella 
(1998, Theorem 5.7; although there are some typos). Tan (2015) and elsewhere. 

Lemma 1. Let 9^ he an estimator given by 6^ = Xi if n = 1, and otherwise 

e^ = X,-biX,-X), b = mm{l,CnV/sl) (13) 

where X = Y^^=iXi/n, V = ~ 1 ) is a positive 

constant. Let Vmax = maxi<„ 17 and < = {[(n - 3) - 2{VmajV - l)]/(n - 1)}+ = 


7 


{1 - 2{y^^^/V)/{n - 1)}+. Then for 0 < c„ < 2c;, 


n 2 

< F[l-(l-l/n)E{(2c;-c„)6+(2-2c; + c„-4/F)/^,2/y<,„|} 

i=l 


< V. 


(14) 


Remarks: 


1. The main reason for using X is analytical simplicity. When 6i are all equal, the MLE 
of the common mean is the weighted least squares estimate (X)r=i ^*/^)/(X)r=i 1/^)- 

2. In (14) note that when > c„, (2c; — c„)6 = (2c; — Cn)cnVjs'^ attains maximum 
at Cn = c;. In the homoscedastic case Fnax = V and c; = {n — 3)/{n — 1) is the 
usual constant for the James-Stein estimator that shrinks toward the sample mean. 
In the heteroscedastic case, for a version of the estimator above that shrinks toward 
zero, a sufficient condition for minimaxity appears in Tan (2015) as 0 < c„ < 2{1 — 
2(Vmax/F)/n}. This is consistent with Lemma 1. 

3. For one-way unbalanced ANOVA, F = where is the error variance and Uj is the 
number of observations for the Ath sub-population. Suppose that cx^ is unknown and 
that we have an unbiased estimator = Sk/k oi independent of the observations, 
where Sk/c"^ ~ xl- Then replacing F in the lemma with the corresponding estimates 
Vi = a'^ f Hi, the same conclusion still holds with 0 < Cn{l + 2/k) < 2c*. 

We are now ready to introduce an empirical Bayes estimator, which employs the 
spherically symmetric estimator of Lemma 1 to mimic the oracle rule 0“*’^*. When the 
number of distinct values F is very small compared to n, a natural competitor of is 

obtained by applying a James-Stein estimator separately to each group of homoscedastic 
observations. Under appropriate conditions, this estimator asymptotically approaches the 
oracle risk (12). The situation in the general heteroscedastic problem, when the number of 




distinct values Vi is not very small compared to n, is not as obvious; still, the expression for 
the optimal function a* and h* in ( 11 ) suggests grouping together observations with similar 
variances V), and then applying a spherically symmetric estimator separately to each group. 

Block-linear shrinkage has been suggested before for the homoscedastic case by Cai (1999) 
in the context of asymptotic adaptive wavelet estimation. However, the estimator of Cai 
(1999) is motivated from an entirely different perspective, and addresses a very different 
oracle rule (itself a blockwise rule) from the oracle associated with our procedure. See also 
Ma et al. (2015). For the heteroscedastic case. Tan (2014) comments briefly that block 
shrinkage methods building on a minimax estimator can be considered to allow different 
shrinkage patterns for observations with different sampling variances; this is very much in 
line with our approach. 

Definition 1 (Group-linear Empirical Bayes Estimator for a Heteroscedastic Mean). Let 
Ji,..., Jm be disjoint intervals. For k = 1, ...,m denote 


T- _ r.- . TA ^ T T _ IT- I t 7 _ ^ ~ ^kf 

\d -^i ^ Jk\i 'kl'k Hfc ) Vk / ) ^ ) "Sfc / ^ r) 1 

nu Uk ni. V 2 — 1 

2G-i/c 


Define a corresponding group-linear estimator 0^^ componentwise by 


gGL 


{ Xi - min {l,CkVk/s^^{Xi 


otherwise 


(15) 


and note that 6i = X^ when Vi ^ or Vi & Jk for some k with = 0. 

Theorem 1. For 6 = 6^^ in Definition 1 with = {l — 2(^maxVi/Vk)/{nk — 1)} , the 

' iSifc 

following holds: 

1. Under the Gaussian model (1) with deterministic {6i,Vi)fi < n, the risk of 6 is no 
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greater than that of the naive estimator X and therefore 6 is minimax 


1 

n 




2 


1 ^ 


n 


Y^V.^V. 


( 16 ) 


2=1 2=1 2=1 

2. Let {Xi,9i,Vi),i = be i.i.d. vectors from any fixed (with respect to n) 

population satisfying (1). Let (F,^,^4) he defined by (6); r{a,b) as defined in (8); 
and a* and b* as defined in (11). Then 


1 

n 


2=1 


V 


<-J2r{a*,b*\\() + o{l) 
n 


2=1 


( 17 ) 


with V = (Fi, Vn) and for any seguence Vi, F,... such that the following holds: 
With |J| being the length of interval J, 


max |Jfc| —>■ 0, min —>■ cx), a* {v), b* {v) are uniformly continuous 

l^fc^7?2 l^/c^TTi 


lim sup 


Er=i 


n 


< oo, lim sup 


Eii v.Vv.M 


iF} 


n 


0 


( 18 ) 


Remarks on the second part of the theorem: 

1. Note that when (Xj,6'j,F) are i.i.d., then each triple is distributed as (F, A). We 

assumed that the ‘population’ distribution (F, A) itself does not depend on n (in 
which case r{a,b) and a*,b* indeed do not depend on n). A similar statement would 
still hold when the distribution of (F, A) depends on n, under the conditions that 
{a*}, {6 *} are equicontinuous and {a*} is uniformly bounded for any given finite 
interval. Although not considered here, an analogue of the second part of the theorem 
could be stated for the nonrandom situation, Xi\{6i,Vi) ~ A^(6 'j,F),l < i < n with 
deterministic 6i and Vi. In this case, suppose that the empirical joint distribution 
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Gn of {{9i,Vi) : 1 < i < n} has a limiting distribution G. Then if we define the 
risk for candidates a„, bn to be computed with respect to G, our estimator enjoys 
r(an,bn) —t r{a*,b*) under appropriate conditions on a*,b*. 

2. The continuity of shrinkage factor and location b*{v),a*{v) allows to borrow strength 
from neighboring observations with similar variances. To asymptotically mimic 
the performance of the oracle rule, maxi<fc<m|^fc| —)■ 0 , mmi<k<mnk —t oo 
are necessary wherever shrinkage is needed. The only intrinsic assumption is 
limsup„_,.go 1^/n < oo, essentially ‘equivalent’ to bounded expectation of 
A. It ensures that ma.xi<k<m\Jk\ —^ 0, mmi<k<mnk —t oo are satisfied when 

are chosen to cover most of the observations, and at the same time 
limsup^^^ m = 0 , which takes care of the remaining observations 

(large or isolated Vj), and guarantees that their contribution to the risk is negligible. 

3. A statement on Bayes risk, when expectation is taken over V in (17), can be obtained 
in a similar way by replacing the conditions on V with bounded expectation of the 
random variable A. We skip this for simplicity. 

For the i.i.d. situation of the second part of Theorem 1, the case r(a*, b*) = 0 corresponds 
to ^ = a*{A), a deterministic function of A (equivalently, b*{A) = 1). In this case the 
precision in estimating the function a* is crucial, and calls for a sharper result than (17) 
regarding the rate of convergence of the excess risk. Noting that, trivially, ^ = a*{A) implies 
that IE(.^|A = v) = a*{v), Xi\Vi ~ N{a*{Vi),Vi) is a nonparametric regression model, i.e., 
6i is a deterministic measurable function of V). In this case, the rate of convergence in (17) 
depends primarily on the smoothness of the function a*{v). In the homoscedastic case the 
smoothing feature of the James-Stein estimator was studied in Li and Hwang (1984). The 
following theorem states that the group-linear estimator attains the optimal convergence 
rate under a Lipschitz condition, at least when A is bounded. 
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Theorem 2. Let {Xi,9i,Vi),i = be i.i.d. vectors from a population satisfying 

(1). Ifr{a*,b*) = 0 and a*{-) is L-Lipschitz continuous, then the group linear estimator in 
Definition 1 with egual block size \Jk\ = | ^ ~ 


1 

n 


E 

i=l 


V 


< 2 




n 


(19) 


for any deterministic seguence V = {Vi,Vn). 


For the asymptotic results in Theorems 1 and 2 to hold, it is enough to choose bins Jk of 
equal length | J| = However, in realistic situations, where n is some hxed number, 

other strategies for binning observations according to the Vi might be more sensible. For 
example, by Lemma 1 and the hrst remark that follows it, bins that keep (max{Vi : i G 
Jk})/Vk (rather than max{Vi : i G Jk} — min{14 : i £ Jk}) approximately hxed may be 
more appropriate. Hence we propose to bin observations to windows of equal lengths in 
log(l^) instead of Vi. Furthermore, instead of the constant multiplying in | J|, which 

may be suitable when the Vi G (0,1], we suggest in general to hx the number of bins to 
i.e., divide log(14) to bins of equal length [maxj(logl4) — minj(log On a hner 

scale, for a given choice of {Jk}, there is also the question whether any two groups should 
be combined together, and the shrinkage factors adjusted accordingly; this issue arises even 
in the homoscedastic case (Efron and Morris, 1973a). Note that, trivially, minimaxity is 
preserved when the values of Vj, but not Xj, are used to choose the bins Jk- 

As for performance of the group-linear estimator for hxed n, some situations are certainly 
harder than others. In the best scenario where the variances are clustered at a hxed hnite 
set of possible values, the method is expected to work very well with fast convergence in 
(17). Otherwise, the method is expected to work reasonably well in the sense of (17) when 
max Vj/ min Vi is not too large, whether the distribution of Vi is continuous or not, because the 
large clusters will beneht from shrinkage and small clusters will have small total contribution 
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to the risk due to minimaxity within each group. Still, the difference between the two cases 
could be nontrivial in finite samples. In the third and worst case scenario, the sequence of 
variances is rapidly increasing so that the benefit of grouping is small for a large fraction of 
relatively large variances. This could also happen when the variances are small, as the risk 
ratio between the group and naive estimators depends only on the ratio I 4 /I 4 iax- 

4 Simulation Study 

In this section we carry out a simulation study using the examples of Xie et ah (2012), 
and compare the performance of our group-linear estimator to the methods proposed in 
their work. In each example, we draw n i.i.d. triples (Xj, 0^,14) ~ such that 

X(^, ^4); the last example is the only exception, with X|(^, A) oo A), in order 
to assess the sensitivity to departures from normality. Various estimators are then applied 
to the data (Xj, V)), 1 < * < and the normalized sum of squared error is computed. For 
each value of n in {20,40, 60,..., 500}, this process is repeated N = 10, 000 times to obtain a 
good estimate of the (Bayes) risk for each method. Among the empirical Bayes estimators 
proposed by Xie et ah (2012) we consider the parametric SURE estimator given by 

=X,--^(X,-}I), l<*<n 

Vi + l 

where 7 and p minimize an unbiased estimator of the risk (SURE) for estimators of the form 
6^''^ = Xj — [U /{Vi - 1 - 7)](Xj — /i) over fi and 7. We also consider the semi-parametric SURE 
estimator of Xie et ah (2012) with shrinkage towards the grand mean, defined by 

ef^ = Xi- %{Xi - X), l<t<n (20) 
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Table 1: Oracle shrinkage locations and shrinkage factors, fi*,v/{v + 7 *) and a*{v),b*{v), 
corresponding to the family of estimators of Xie et ah (eqnation (23)) and to the family of 
estimators that are linear in Y (eqnation (24) ). Colnmns correspond to simnlation examples 
(a)- (f). Valnes of for each example are from Xie et ah (2012). 


(b) 


(d) 


(f) 


//* Q V r V 

^ ’ ?;+ 7 * v+1 4;+.083 

a*(n), b*{v) 0,^ 0,^ 


0 . 6 , 


0.13, 


0.15, 


0 . 6 , 


u+0.078 ,;+0.0032 i;+0.84 i;+0.078 

n ,0 n ,0 27^=0.!} (1^)) 0.5 n, 0 


where b = (5i,..., bn) minimize an nnbiased estimator of the risk for estimators of the form 
7 ’^ = Xi — bi{Xi — X) with b = ( 61 , ...,bn) restricted to satisfy bi < bj whenever Vi < Vj. 
The gronp-linear estimator 6^^ of Dehnition 1 is applied here with the bins Jk formed by 
dividing the range of log(Vi) into eqnal length intervals, per the discnssion conclnding 

Section 3. As benchmarks, in each example we also compnte the two oracle risks 


r(/i*,7*)= min E | [X- . {Y - fi) - (]' 

^ M,7eIR : 7>0 A+ 7 ^ ^ 


( 21 ) 


and 


r(a*,b*)= min E 

a(-),6(-) : a(i;)>0 Vii 


{[F-K4)(F-a(7))-^]' 


( 22 ) 


corresponding to the optimal rnle in the parametric family of estimators considered in Xie 
et al. (2012, labeled “XKB oracle” in the legend of Fignre 1), and to the optimal linear-in- 
X rnle of Section 2, respectively. Note that /i* and 7 * are nnmbers whereas a* and b* are 
fnnctions. Table 1 displays the oracle shrinkage locations and shrinkage factors corresponding 
to ( 21 ) and ( 22 ); note that v/{v + 'y*) is strictly increasing in v, while b*{v) is not necessarily. 

Fignre 1 shows the average loss across the N = 10, 000 repetitions for the parametric 
SURE, semi-parametric SURE and the gronp-linear estimators, plotted against the different 
valnes of n. The horizontal line corresponds to The general pictnre arising from 
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the simulation examples is consistent with our expectation that the limiting risk of the 
group-linear estimator is smaller than that of both the parametric SURE estimator, as 
r{a*,b*) < r(/i*, 7 *), and the semi-parametric SURE estimator, as r{a*,b*) < inf{r(a,6) : 
b{v) monotone increasing in u}. For moderate n, whenever ^ and A are independent, the 
SURE estimators are appropriate and achieve smaller risk. By contrast, the situations 
where ^ and A are dependent are handled best by the group-linear estimator, which indeed 
achieves signihcantly smaller risk than both SURE estimators. 


(a) A-Unif(0.1,1), |-N(0, 1) (b) A - Unif(0.1,1), §~Unif(0, 1) 



Figure 1: Estimated risk for various estimators vs. number of observations. 
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In example (a) (7.1 of Xie et al., 2012) A ~ Unif(0.1,1) and ^ ~ 77(0,1), independently. 
In this case, the linear Bayes rule is of the form (4) and, in particular, the functions a* and b* 
are constant in v. The parametric SURE estimator is therefore appropriate, and it performs 
best, requiring estimation of only two hyperparameters. The group-linear estimator and 
the semi-parametric SURE perform comparably across values of n. Here r(/i*, 7 *), r{a*,b*) 
and the limiting risks of the parametric SURE and the group-linear estimator, are all equal 
(~ .3357). In example (b) (7.2 of Xie et ah, 2012), A ~ Unif(0.1,l) and ^ ~ A^(0,1), 
independently. This situation is not very different from the first example when it comes 
to comparing the SURE estimators to the group-linear, since the functions a* and b* are 
constant in v as long as ^ and A are independent. The risk of the group-linear approaches 
the oracle risk (f» .0697), but here the semi-parametric SURE estimator seems to do a little 
better, perhaps in part because it (correctly) shrinks all data points toward exactly the same 
location. 

The third example (c) (7.3 of Xie et ah, 2012) takes A ~ Unif(0.1,1), ^ = A. Here ^ and 
A are strongly dependent, and indeed the gap between the two oracle risks, r(/i*, 7 *) .0540 

and r{a*,b*) = 0, is material. The advantage of the group-linear estimator over the SURE 
estimators is seen already for moderate values of n. Although it is hard to tell from the figure, 
the limiting risk of the semi-parametric SURE is slightly smaller than that of the parametric 
SURE, because of the improved capability of the semi-parametric oracle to accommodate 
the dependence between ^ and A. In the fourth case (d) (7.3 of Xie et ah, 2012) we take 
A ~ Inv-y^g, ^ = A. is still a deterministic function of A, but it takes larger values of n 
for the group-linear estimator to outperform the SURE estimators. This is not seen before 
n = 500, which seems to be a consequence of the non-uniform distribution of the U, and 
only partially mitigated by binning according to log(U)- 

Example (e) (7.5 of Xie et ah, 2012) reflects grouping: A equals 0.1 or 0.5 with equal 
probability; .^|(A = 0.1) ~ N{2, 0.1) and .^|(A = 0.5) ~ A^(0, 0.5). In each of the two variance 
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groups, the group-linear estimator reduces to a (positive-part) James-Stein estimator, and 
performs significantly better than the SURE estimators. While not plotted in the hgure, the 
other semi-parametric SURE estimator of Xie et ah (2012), which uses a SURE criterion 
to choose also the shrinkage location, achieves signihcantly smaller risk than the SURE 
estimators considered here; still, its limiting risk is 16% higher than that of the group-linear. 

Lastly, in (f) (7.6 of Xie et ah, 2012) A ~ Unif(0.1,1), % and U|y4 ~ Unif(,^ —\/iL4, ,^-|- 

■y/SR), violating the normality assumption for the data. The group-linear estimator is again 
seen to outperform the SURE estimators starting at relatively small values of n, and its risk 
still tends to the oracle risk r{a*,b*) = 0. By contrast, the risk of the parametric SURE 
estimator approaches = 0.054. The semi-parametric SURE estimator does just a 

little better, its risk approaching 0.0423. 

5 Real Data Example 

We now turn to a real data example to test our group-linear methods. We use the popular 
baseball dataset from Brown (2008), which contains batting records for all Major League 
baseball players in the 2005 season. As in Brown (2008), the entire season is split into two 
periods, and the task is to predict the batting averages of individual players in the second 
half-season based on records from the hrst half-season. Denoting by Hji the number of hits 
and by Nji the number of at-bats for player i in period j of the season, it is assumed that 

Hji ~ Bin{Nji,pi), j = 1,2, i = 1, (23) 

As suggested in Brown (2008), a variance-stabilizing transformation is hrst applied, Xji = 
arcsin{(i7ji -|- l/4)^/^/(Ajj -|- 1/2)^/^}, resulting in Xji ~ A^(6'j, l/(4A(jj)), 6i = arcsin(pj), 
and {(Xij, A^ii) : i = 1, are then used to estimate the means 6i. We should remark 
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that there is no reason for using this transformation, and for focusing on estimating 9i 
instead of pi, other than making the data (approximately) £t the heteroscedastic normal 
model (note that the variance of the obvious statistic Hji/Nji depends explicitly on p,, and 
therefore is not suitable). Indeed, one might object to analyzing the baseball data using 
a normal model instead of using the binomial model (23) directly (as in Muralidharan, 
2010). Our only response is that the purpose of our analysis is primarily to illustrate the 
possible advantages of the group-linear estimator - and more generally, of methods that can 
accommodate statistical dependence between the means and the known variances - in the 
heteroscedastic normal problem. 


To measure the performance of an estimator 6, we use the Total Squared Error, TSE(0) = 
E. iX2^-e,r — l/(4A^"2i) , proposed by Brown (2008) as an (approximately) unbiased 


estimator of the risk of 6. Following Brown (2008), only players with at least 11 at-bats 
in the hrst half-season are considered in the estimation process, and only players with at 
least 11 at-bats in both half-seasons are considered in the validation process, namely, when 
evaluating the TSE. To support our comparison, in addition to the analysis for the original 
data, we present an analysis under a permutation of the order in which successful hits appear 
throughout the entire season: for each player we draw the number of hits in the Nu at-bats 
of the hrst period from a Hypergeometric distribution, 'HQ^Nu + N 2 i, Hu + H 2 i, Nu). In the 
permutation analysis we concentrate on the two SURE methods of Xie et ah (2012), which 
we consider as the main competitors of our method; the extended James-Stein estimator; 
and the group-linear estimators. 

Table 2 shows TSE for various estimators reported in Table 2 of Xie et al. (2012), 
when applied separately to all players, pitchers only and non-pitchers only. The values 
displayed are fractions of the TSE of the naive estimator, which, in each of the cases (i)-(iii), 
simply predicts X 2 i by Xu- Numbers in parentheses correspond to permuted data, and were 
computed as the average of the relative TSE over 1000 rounds of shuffling as described above. 
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Table 2: Prediction Errors of Transformed Batting Averages. For the five estimators at the 
bottom of the table, nnmbers in parentheses are estimated TSE for permuted data. 



All 


Pitchers 


Non-pitchers 


Naive 

1 


1 


1 


Grand mean 

.852 


.127 


.378 


Nonparametric EB 

.508 


.212 


.372 


Binomial mixture 

.588 


.156 


.314 


Weighted Least Squares 

1.07 


.127 


.468 


Weighted nonparametric MLE 

.306 


.173 


.326 


Weighted Least Squares (AB) 

.537 


.087 


.290 


Weighted nonparametric MLE (AB) 

.301 


.141 


.261 


James-Stein 

.535 

(.543) 

.165 

(.239) 

.348 

(.234) 

SURE 0“ 

.421 

(.484) 

.123 

(.211) 

.289 

(.265) 

SURE 

.408 

(.468) 

.091 

(.169) 

.261 

(.219) 

Group-linear 9 ^^ 

.302 

(.280) 

.178 

(.244) 

.325 

(.175) 

Group-linear (dynamic) 

.288 

(.276) 

.168 

(.283) 

.349 

(.175) 


In the table, the Grand mean estimator uses the simple average of all Xu] the extended 
positive-part James-Stein estimator is given by 9^^^ - fijs+ +{I - (ijs+) 

where (ijs+ = (E*is the parametric empirical Bayes estimator of 
Xie et ah (2012) using the SURE criterion to choose both the shrinkage and the location 
parameter; and 6^^ is the semi-parametric SURE estimator of Xie et ah (2012) that shrinks 
towards the grand mean. Also included in the table are the nonparametric shrinkage methods 
of Brown and Greenshtein (2009); the weighted least squares estimator; the nonparametric 
maximum likelihood estimators of Jiang and Zhang (2009, 2010) (with and without number 
of at-bats as covariate) and the binomial mixture estimator of Muralidharan (2010). 

For the group-linear estimator, in addition to the plain estimator 6^^ that uses 
k = equal length bins on log(^^) (as in the simulation study), we considered 

a data-dependent strategy for binning. The estimator labeled “dynamic” in Table 2 
chooses, among all partitions of the data into contiguous bins containing no more than 
observations each, the partition which minimizes an unbiased estimate of the risk 
of the corresponding group-linear estimator. This can be viewed as an extension of the 
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plain version, which for uniformly spaced data would put ~ observations in each 
of bins. Our implementation uses dynamic programming (code available online at 

https: //github. com/MaZhuang/grouplinear). We remark that using the observed data in 
forming the bins may lead to loss of minimaxity of the group-linear estimator. Nevertheless, 
we hnd it appropriate to explore such strategies when applying the estimator to real data. 

Both versions of the group-linear estimator perform well in predicting batting averages for 
all players relative to the other estimators. As discussed in Brown (2008), nonconformity to 
the hierarchical normal-normal model on which most parametric empirical Bayes estimators 
are based, is evident in the data: first of all, non-pitchers tend to have better batting 
averages than pitchers, hence it is more plausible that the 6i come from a mixture of two 
distributions. Second, players with higher batting averages tend to play more, suggesting 
that there is statistical dependence between the true means, 6i, and the sampling variances of 
Xji (oc 1/Nji)] see Figure 4 in Brown (2008). While the nonparametric MLE method handles 
well non-normality in the “prior” distribution of the 6*j, its derivation still assumes statistical 
independence between the true means and the sampling variances. The group-linear 
estimator achieves good performance in this situation because it is able to accommodate 
this dependence between the true means and the sampling variances. 

When analyzing pitchers and non-pitchers separately on the original data, the SURE 
methods achieve dramatic improvement, and outperform the group-linear estimators by a 
significant amount. However, the results are quite different for shuffled data. The difference is 
seen most prominently for non-pitchers: when actual second half records are used, the group- 
linear incurs higher prediction error as compared to the semi-parametric SURE estimator 
(0.325 vs. 0.261); but the opposite emerges for shuffled data (0.175 vs. 0.219). For pitchers 
only, the estimators of Xie et al. (2012) outperform the group-linear in both the standard 
analysis and the permutation analysis. This is reasonable as the association between the 
number of at-bats and the true ability is expected to be weaker than within non-pitchers. 
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6 Conclusion and Directions for Further Investigation 


For a heteroscedastic normal vector, empirical Bayes estimators that have been suggested, 
both parametric and nonparametric, usually rely on a hierarchical model in which the 
parameter Qi has a prior distribution unrelated to the observed sampling variance Vi = 
VeiT{Xi\6i). If separable estimators are considered, representing the heteroscedastic normal 
mean estimation problem as a compound decision problem, reveals that this model is 
generally inadequate to achieve risk reduction as compared to the naive estimator. Group- 
linear methods, on the other hand, are capable of capturing dependency between 6i and Vi, 
and therefore are more appropriate for problems where it exists. 

There is certainly room for further research. We point out a few possible directions for 
extending Theorems 1 and 2, that are outside the scope of the current work: 

1. In the i.i.d. case, the distribution of the population (Y, ^4) may be allowed to depend 

on n in such a way that rn{a^, 6*) —)■ 0 as n —)■ oo. In this case the criterion (17) should 
be strengthened to the asymptotic ratio optimality criterion 



(24) 


as n —)■ cx). As (24) does not hold uniformly for all joint distributions of (V, ^,A), 
a reasonable target would be to prove (24) when r„(a*,6*) > for small under 
suitable side conditions on the joint distribution of {Y, A). This theory should include 

(17) as a special case and still maintain the property (16). 

2. When a*{v) satishes an order a smoothness condition with a > 1, a higher-order 
estimate of a* (Vi) needs to be used to achieve the optimal rate in the 

nonparametric regression case, r{a*,b*) = 0, e.g., a{V) with an estimated polynomial 
a{v) for each J^. We speculate that such a group-polynomial estimator might still 
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always outperform the naive estimator 6i = Xi under a somewhat stronger minimum 
sample size requirement. 


Appendix: Proofs 

Proof of Lemma 1 It suffices to consider 0 < < 2c*. Let b{x) = min(l, CnIL/x) such 

that b = Notice that {d/dXi)s\ = 2(Xj — X)/{n — 1). By Stein’s lemma, 

E(X, - - X)b = V, e|(1 - l/n)b{sl) + 2(X, - Xfb'{sl)/{n - 1)}. 


By dehnition, 2Vi/{n — 1) < IL(1 — c*) and xb'{x) = —b{x)I{b{x) < 1}, 

n 

Y,^Ui-{Xi-X)b-9i 


2 = 1 
n 

E 

2 = 1 


Vi + E(A'i - X)V(4) - 2ViE i (1 - l/n)b(4) + 


2{Xi - XY^b'(4) 


n — 1 


< V + {1- l/n)E - 21/6(4) + V{1 - 4)26(4)/|.j^„„,7, 

= F + (1 - 1/n) EF6(4) {min {4/7, c„) - 2 + 2(1 - 


— 1/ — (1 — 1/n) E Vb{s^) < (2c* — c „)/|^2 + (2 — Sn/^)^{si<c„v} 


= 1/ 


1 - (1 - 1/n) E < b{s^) (2c* - Cn) + (2 - 2c* + Cn - Sn/^)^{sl/V<c„} 


< I/. 


Dehne e\j\ = max {|a*(ni) — a*{v 2 )\, \b*{vi) — b*{v 2 )\}, g{v) = Var(^|y4 = v) and h{v) = 

vi,v2eJ 

E(,^^|24 = v). Unless otherwise stated, all expectations and variances are conditional on V. 

Lemma 2 (Analysis within each block). Let {Xi,6i, V))r=i i-i-d. vectors drawn from some 
population (U, ^,A) satisfying (9) with n > 2. If Vi, ■■■ ,Vn E J for some interval J and 
mini<j<„6*(Vi) > ^ib*{V) > e for some e > 0. Then the spherically symmetric shrinkage 
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estimator defined in (15) with Cn = c* satisfies 




2=1 


V 


1 . \ 7T/ 

n n — 1 

2 = 1 


+ {^^J\ + l-^l) 


e‘^ + 1 


1 

^212 


+ £l. 


n 


—{5^ vf + 2j2(y, + v)m) + 

2=1 2=1 


( 25 ) 


where ^ max{ V), ■ ■ ■ , K} and V ^ Ya^i ri/n. 


Proof of Lemma 2 As in the proof of Lemma 1 with Cn = d^, 


-Ee[(?-9- 


2 = 1 


< 


-J2^[Xi-iXi-x)b-e, 

2=1 

^ + (l - ^) E VKsl) {min {sl/V, <) - 2 + 2(1 - 


By dehnition, r(a*,6*|V)) = Vfil — b*{Vi}) and min (s^/i/,c*) < c* < 1. Then, 


-5:E[(g:.-« 


2=1 


i” l” ^ _ 

< - V r(a*, + - V bfiV^Vi - 1 - - VE{b) + 21^(1 - < 

n n \ n 

2=1 2=1 ^ 


Observing that 0 < 6 < 1 and V{1 — c*) < 2 Vma ^ /(n — 1), 




2 = 1 


V 


-| f o ^ I L 

< b*\V,) + -l) + V/n+^Yl “ ^^(6) 


n 


2=1 


2=1 


1 ^ 

- “ + 5f4iax/(^- 1 ) + ^{ max 6*(Oj) - E 

n \ l<i<n / 

2=1 

1 ^ 

= - Vr(a*,6*|\/i) + 5Knax/(n- 1) +F{ max 6*(F) - F(F)} + Ff6*(F) - Efo) 
n l< 2 <n ^ \ / 


2=1 

n 


-fiYl + 5Fmax/(n - 1) + Feui + V(b*{V)- Eb 

2=1 


where the last inequality is due to the uniformly continuity of b*{v). Next we will bound 
v(b*{V)-Eby By dehnition, F(6FF) - E&) = FE|F/Var(F|A = F) - min(l, c);F/4)|. 
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Further observe that V/Yax{Y\A = V) = V/ (V + Var(,^|yl = V)) < 1, 

V(b\V) - Eb) < Fe{ (F/Va.r(K|yl = F) - <F/4)} 

< rE{ (1 - c:;Var(y|^ = F)/4) 

= EF{(1 - - VaiU'l^ = } 

Also, noting that 1 — c* > 0 and c* < 1, 

V(b*{V) - Epj < V{1 - c*J + E|4 - Var(F|A = V)\ 

< 2\d„ax/(n - 1) + E|4 - E4| + |E4 - Var(y|A = F)| 

= 2Knax/(n - 1) + E|Ee|4 - E^]} + \Esl - Var(F|A = F)| 

< 2Fnax/(n - 1) + Ev/Var(4|0) + \Esl - Var(F|A = F)| 

< 2Fnax/(n - 1) + {E[var(4|6>)] }" + \Esl - Var(F|A = F)| 

where the last two inequalities are due to Jensen’s inequality. Conditionally on V = 
(F, • • ■ , F) and 0 = (6'i, • • ■ , F), X ~ ^”=1 FF^), and therefore 

1 ^ 1 ^ f) 

Ef''* +"?) - F‘ ‘ - v} 

i=l i=l 

1 ^ 

= r + {(n - 1) ^ EKV = F.) - 5; EK|yl = F)EK|/1 = U)} (26) 

^ ' i=l j^k 

n n 1 ^ 

^ ' i=l i=l j=l 

<v + i Var({| 2 l = I/.) + ^ ^ [E(«|A = F) - ^ f;E(«|A = V,)f 

i=l i=l j=l 

2=1 2 = 1 Jl = l 

On the other hand, Var(F|A = V) = V + Var(,^|A = V) = V + g{V). Hence, 
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-i n -i n 1 ^ 

|E(sJ) - Var(r|A = F)| < - |9(^) - g(F)| + —T E 

IL , IL A- _ IL , 


The uniform continuity of a*{v) implies that |a*(Vi) — a*(Vj)/n\ < {n — l)/ne|j|. By 

dehnition, h*{v) = v/{v + giv)), then g{v) = v/h*{v) — v and therefore 


\giy;)-g{y)\ = 


< 


V,h*{V)-Vh*{Vi) 

b*{Vi}b*(V) 

\V,[b*{V)-b*{V^]\ 

b*{Vi)b*{V) 


+ {V. - V) 

\{V,-V)b*^^\ 

b*{Vi)b*{V) 




< 




where the last inequality follows from mini<i<„ 6*(Vi) > e,b*{V) > e. Combining the two 
inequalities above, |IE(<s^) — YaxiY\A = V)\ < {V€.\j\ + \ J\) + | J| + e^jy Finally, we are 

going to control E|var(s^|0)|. Again, X\V,6 ~ A^(X]r=i SILi hence 

E{var(sS|e)} = 7 —TT 5 E{var( X? - nV>) } 

^ ^ i=l 

2 ^ 

S 7FVTy®P”(E-^tl«) +Var(rf"|e)} 

' i=l 

9 n 

= E + 4«,V.) + n" (2F"/n" + 4»"F/n) } 

' 2=1 

By dehnition, h{v) = E(,^^|A = u), and, noting that riO^ < 

. n n n 

VMV.) + V^ + 2vY,h(v,)] 

2^1 2^1 2^1 

. n n V / 

S (FTW { E + 2 E('". + + f"} 

^ ' 2=1 2=1 

Put pieces together, we have 

F(F(F) - Et) < + |9| ^ ^ 2 I ^ 2 ^ ^2 J j 

! L A- t- _ y 
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i=l 


1 

< -5^rK,6-|l/j) + 

n 

i=\ 


7K. 


n — 1 


+ + l'7|) 


e^ + 1 


+ e 


\J\ 


n n 1 

+E+2E(''‘+’")'>(''<)+ 

i=l i=l 

Proof of Theorem 1. The first part the Theorem follows from Lemma 1. For the second 

part, it suffices to show that for all £ > 0, the excess risk is On{e). Notice that the 

contribution to the normalized risk for observations outside is = 

o(l), we only need to consider the case where VI < i < n, G Without loss of 

generality, we assume VI < fc < m, either C [0,e:) or C (e, +oo) since we can always 

reduce e such that this happens. Due to the assumption that hmsup„_,.oo Vi/n < oo, 

we can also choose large enough such that {Vi>M^}/n < e and for any k with 

Jk C (e,+ cxd), either Jk C (e. Mg) or Jk C {Ms,+oo). 

For the rest of the proof, we divide all the observations into four disjoint groups and 

_^ 

handle them separately. Let V = dehne Si = {k\l < k < n,Jk C 

(0,e:)},S'2 = {k\l < k < n,Jk C (e. Mg), minv^ej^ > e,b*{v'") > ej^Ss = {k\l < k < 

n,Jk C (£,Mg),minv',ej^ fe*(Di) < e or b*{v’") < e},54 = {k\l < k < n,Jk <Z (Mg,+cx))}. 
Case i) For the small variance part, Vi G (0,^), the contribution to the risk is negligible. 
Because the group linear shrinkage estimator dominate the MLE in each interval, then 


1 

n 




keSi ieXk 


< £/n < £ 

fceSi ieXfc fce5i ieXk 


Case ii) For moderate variance with large shrinkage factor, Vi G (e. Mg) and b*{Vi), b*{V) > 
e, shrinkage is necessary to mimic the oracle. Applying Lemma 2 to each interval Jk, k G S' 2 , 


keSi 


V 



keS2 i&Xk keS2 ^ 


++ 


s^ + 1 2 2 

+ 'l.'.l + iTZ 


^ 12= + 2 ^(12, + V2‘);i(12,) + (12‘)=) = } 
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Let iJlmax = max iJfcLemax = max ei/j. Using the fact that max nu/inh — 1) < 2, 

l<k<m l<k<m ' l<k<m 

^ - '’•) V] - ^ E EE 

keS 2 i&Ik k&S 2 ielfe k&S 2 

+ nk{V Cmax + I'^lmax)—-1" 4 ^ + 2 (U + U )h{Vi) + {V 

_^ 

For any A; G S' 2 , i G X^, V , Vi < M^. Because a*{v) is uniformly continuous on [0, M^], there 
exists constant only depending on e such that a*{Vi) < C^. Then, 


h(u) = Var(e|dl = U) + (e{^\A = u))' < Vi/b*{V) - U + {a*{V))‘' < M,/e + 


^ 5^r(a*,6*|U) + + |JUax) +e^ 


fceS2 i&Xk 


k£S2 ielk 


+ (^eCmax + |X|niax)-^-1- a/ 2M|(1+£ ^) + 2M£C'e 


n 


By the Cauchy Schwarz inequality: X]fceS 2 — \ \^‘^\^k&S 2 ^^ — Further 


observe that 15*21 < m < n/ min Uk, then 

l<k<m 


V V E [( 0 , - vl < - V V r{a*, b*\Vi) + — -(m, + | JUax) + 

l\ / in mm nt. V / 

kGS2 i^'^k kGS2 i^'^k l</c<m 


£2 + 1 


+ (M, 

^max “1“ II max )~ji —^ 


/ mm Uk 

l<k<m 


/2M2(1 + £-i) + 2M,C', 


Since |J|maxUmax “t 0 and min Uk —)■ +C) 0 , we obtain 

l<k<m 

EEe[(«.-«.) lv] < iEE^o-.t-ir.) + o(e) 

keS2 ielk k&S2 i&Xk 

Case iii) For moderate variance with negligible shrinkage factor, U* G (£, M^) and 
minjgjj. b*(yi) or b*(y) < e. The uniform continuity of b*{-) implies that Vi G Ik, b*(yi) < 
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£ + ^max- By definition r(a*, h*\Vi) = i^(l — 6*(\/i)), then 

i E E'’•I'".) = ^ E E '"■(1 - ‘'O".)) ^ ^ E E +'»») 

fceSa ielk keSs ielfe k&Ss iel* 


Since the proposed group linear shrinkage estimator dominates MLE in each block, 


1 


n 




V 


1 

< - 
n 


EE r{a*,b*\Vi) + V{e + e^^) 


kGSs iEXfc k^Ss i£Xf^ 

Case iv) For the large variance part, Vi G (M^, +cxd), by the dehnition of M^, 


1 

n 


EEe[(«<- 


e, 




/cGiS '4 


2 = 1 


Summing up the inequalities of all four cases 


1 

n 




2=1 


V 


<-yr{a*,b*\Vi) + {V + 2)e + o{e) 
2=1 


which completes the proof by the assumption that limsup Bi/n < 00 □ 

n^oo 

Lemma 3 (Analysis within each block). Let {Xi,6i,Vi)^^i be i.i.d. vectors from some 
population (E, A) satisfying (9). Ifr{a*,b*) = 0,a*(-) is L-Lipschitz continuous and 
Vi, ■ ■ ■ ,Vn & J for some interval J, then the estimator defined in (15) with Cn = c* satisfies 


1 

n 


n 


2=1 


E 


[(?- 


< -h| + 3E /n + 4Vkiax/('^ V 2 — 1) 


Proof of Lemma 3 As in the proof of Lemma 1 and substitute Cn with c 


-^e[(9,-9, 


-Y,^[Xi-{x,-x)b-e,\v 


* 

n 


< V 


2=1 

1 - 


= V 



(1 - l/n)E |&(2c* - Cn) + (2 - 2c* + C„ - S^/C)/|^ 2 /y<c„}| 
(1 — 1/n) E jfoc* + (2 — c* — ■s^/E)/|g 2 /y<c*}| 
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- yE |(1 - bcD - (2 - - (c* - sl/V)I^^ 2 /v<c*} 

+ E jftc* + (2 — C* — Sn/V)I{sl/V<c1^} 


n 


Notice that 2 — 2c* > 0 and 6c* + (2 — c* — < 2. 


2=1 


^ 5^ E[(«. - 9.)■ Vj < VEI(1 -?<) - K - 4/V)If.,,v<c.j } + 2V'/n 




< V'Elcyi - 6) - K - <,/V')/|4/F<o;)| + ai'/n + (1 - Ol' 

-KV-sl) } + 2V/n + {l-c:)V 


_ /- c*V 
<EU*V ‘ ” 


< e { k - <>')+ - {<y - +} + 2^^/" + (1 - 

= E(4-<F) + 2F/,i+(1-<)F 


Recall that Ea?. = F + 1 j:”., Var({|^ = K) + ;^ E?.i[E({l-4 = Vi) - ^ = Vjf. 

With Var(f|al = v) = 0, we have EsJ = V+ ^ E”.i“(E)]^ “d 


n 




2=1 


<2(1-<)F + 


/6 /6 

;rv23TE[“TO-;^E“W)]^ + 2F/ 

1=1 j=i 


n 


< L| J|2 + 2i//n + 2(1 - c*jV < L| J|2 + 2V/n + 


4K. 


77, V 2 — 1 


Proof of Theorem 2. Apply Lemma 3 to each interval and notice nk/{nk — 1) < 2, 


2 = 1 


1 


v\ <-J2{nkL\Jk\^ + 2 V+AV, 


nk 


k=l 
|2 


'uk V 2 — 1' 


< L\J\^ + lOmKnax/n = L\J\^ + 101"U(n|J|) 


Letting \J\ = ^, we have that ^ Y^^=i 


Oi — 9i 


V 


< 2 ( 


n / 


□ 
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